LCPA 23-24 - "Hierarchical Mergers of Binary Black Holes"

Group 07

Last Name First Name Student Number
Bertinelli Gabriele 2103359
Boccanera Eugenia 2109310
Cacciola Martina 2097476
Lovato Matteo 2104269

1. Introduction

A binary black hole (BBH) can form via close encounters with black holes (BHs) in a dense stellar environment, such as a nuclear star cluster (NSC), a globular cluster (GC) or a young star cluster (YSC). NSCs are very massive (~ $10^5 - 10^8 \, M_\odot$) star clusters lying at the center of some galaxies, including the Milky Way. GCs are old (~ 12 Gyr) massive (~ $10^4 - 10^6 M_\odot$) stellar clusters lying in the halo of galaxies. YSC are young (< 100 Myr) stellar clusters forming mostly in the disk of a galaxy.

Several channels can lead to the formation of BBHs. But the distinctive signature of the dynamical scenario is the formation of hierarchical mergers (IMs), i.e. repeated mergers of stellar-origin BHs that build up more massive ones. This process is possible only in dense star clusters, where the merger remnant, which is initially a single BH, can pair up by dynamical exchanges or three-body encounters. The main obstacle to the formation of second-generation BHs via hierarchical mergers is the relativistic kick that the merger remnant receives at birth. This kick can be up to several thousand km/s. Hence, the interplay between the properties of the host star cluster (e.g., its escape velocity), those of the first-generation BBH population and the magnitude of the kick decides the maximum mass of a merger remnant in a given environment.

A property that is being studied is that IM can build up IMBHs and also partially fill the pair-instability (PI) mass gap between ~60 and ~120 $M_\odot$, explaining the formation of BBHs like GW190521.

Hierarchical mergers

When two stellar-born BHs merge via GW emission, their merger remnant is called second-generation (2g) BH. The 2g BH is a single object at birth. However, if it is retained inside its host star cluster, it may pair up dynamically with another BH. This gives birth to what we call a second-generation (2g) BBH, i.e. a binary black hole that hosts a 2g black hole. If a 2g binary black hole merges again, it gives birth to a third-generation (3g) BH, and so on. In this way, repeated black hole mergers in star clusters can give birth to hierarchical chains of mergers, leading to the formation of more and more massive black holes.

2. Goal of the project

Understand the differences between hierarchical binary black hole mergers in NSCs, GCc and YSc, by looking at a set of simulated BBHs. Our analysis will be carried out with classification ML algorithms, such as Random Forest and XGBoost. We will then proceed to analyze the importance of features to understand the properties of systems of BBHs.

The idea is to split the analysis into two parts:

To do this in a more detailed way, both globally (all labels together) and locally (single label), we will use SHAP values (Section 6).

To have a cleaner notebook we created a file (hmbh.py) containing the functions needed to create the dataset, to train the ML models and to plot the results.

3. Dataset

The dynamical simulations were run for each host star cluster, with 12 different metallicities each. The files are found in the folders and subfolders:

folder = ['GC_chi01_output_noclusterevolv', 'NSC_chi01_output_noclusterevolv',
               'YSC_chi01_output_noclusterevolv']

metallicity = ['0.0002', '0.002', '0.02', '0.0004', '0.004', '0.006', '0.0008',
                   '0.008', '0.0012', '0.012', '0.0016', '0.016']

Each nth_generation.txt dataset is composed of 28 columns. Among them, we selected these:

cols = ['c0', 'c1', 'c2', 'c3', 'c4', 'c7', 'c8', 'c9', 'c13', 'c15', 'c16', 'c17', 'c25', 'c27']

We added two new columns, in the creation of the dataset:

Import modules

3.1 Dataset creation

In order to create the dataset we use the custom function create_dataset. It used the polars.DataFrame object. Polars is a fast DataFrame library for manipulating structured data. To be fast it optimizes queries to reduce unneeded work/memory allocations and adheres to a strict schema (data types should be known before running the query). Polars leverages the concept of LazyFrame, Polars doesn't run each query line-by-line but instead processes the full query end-to-end. This lazy evaluation strategy allows for better memory management, making it possible to work with datasets that are too large to fit into RAM (it inherits the concept of "lazy" from Dask).

The speed-up is astonishing: an improvement of ~ 900% wrt using Pandas!

In the creation of the dataset, we filter out those systems that take longer than Hubble time (~ 13.6 Gyr) to merge. After creating the dataset, we rename the columns and we add the column label_ngen.

4. Visual description

Let's start by comparing the main features (mass, spin, escape velocity) between different clusters. The goal is to understand "by eye" how the features are distributed inside our dataset and try to find some kind of correlation with respect to the labels. Later, we will need the support of the feature importance insight by ML techniques, that will prove or not the following first set of interpretations.

4.1 Mass distribution

The plots show the distribution of primary and secondary black hole (BH) masses in 2nd-generation and > 2nd-generation systems, categorized by their hosting bodies: Globular Clusters (GCs), Young Stellar Clusters (YSCs), and Nuclear Star Clusters (NSCs). Each graph is divided based on the metallicity levels 0.0002, 0.002, and 0.02.

  1. Systems of 2nd generation: the median masses of the BHs involved in the dynamical merger are larger in GCs and YSCs with respect to NCSs. This addresses the high escape velocity of NSCs, which are more likely to retain hierarchical mergers than other star clusters. Instead, lighter BHs are ejected by SN kicks from lower-mass clusters. Regarding metallicity, metal-poor stellar clusters retain primary BHs of higher masses because of stellar winds, while their mass decreases inside metal-rich clusters. In the case of GCs, the masses cover approximately the entire range, while YSCs favor dynamical encounters, producing larger masses.

  2. Systems of >2nd generation: the dynamical merger between the BHs leads to the formation of more massive BHs. We are under the assumption that only the BHs receiving a natal kick lower than the escape velocity have a chance to pair up dynamically. NCSs witness the formation of more low-mass BHs. The masses follow the same behavior as those in systems of 2nd generation.

4.2 Spin distribution

The spin of a BH is a measure of the angular momentum of the progenitor star. Three-body interactions in hierarchical systems can result in spin-flip events, where the orientation of a black hole's spin changes significantly. The conservation of angular momentum plays a crucial role in determining the spins of the resulting black holes after a merger.

The distribution of the dimensionless spins looks similar in NSCs, GCs and YSCs, by construction, because in the simulation the same spin models have been assumed. Most first-generation mergers have precessing spin squeezed toward low values (∼ 0.1 − 0.2), while nth-generation mergers tend to have high values of $\chi_\text{p}$ ∼ 0.7. This creates a sort of spin gap between ∼ 0.3 and ∼ 0.6.

Different spin magnitudes do not significantly impact the shape of the mass function, the maximum mass and the position of the peak, but they have a strong effect on the number of nth-generation mergers. This effect is particularly important for YSCs and GCs, which have a lower escape velocity than NSCs.

4.3 Mass distribution of the BH remnant resulting from the merger

The plots show the distribution of remnant masses for 2nd and >2nd-generation systems, categorized by their hosting bodies.

  1. Globular Clusters (GC): GCs shows a broader distribution for remnant masses at all metallicity levels. This could indicate a wider variety of remnant masses within GCs, possibly due to the dense stellar environment leading to more frequent and diverse interactions.

  2. Nuclear Star Clusters (NSC): NSC, especially at higher metallicity (0.02), exhibits a concentration towards lower remnant masses. This could be due to stellar wind mass loss being more significant in stars with higher metallicity, leading to the formation of less massive remnants.

  3. Young Stellar Clusters (YSC): YSC produces more massive remnants because of the environment's exceptional density.

The mass function of nth-generation BBHs peaks at values ∼ 30 − 100 $M_\odot$ in all considered star clusters. The main difference between NSCs, GCs and YSCs is the maximum mass of nth-generation BBHs.

4.4 Spin distribution of BH remnant

In the plot, we show the distribution of the dimensionless spin of the BH remnant for systems of 2nd and >2nd-generation systems, for three different metallicities.

It is clear how the distribution for the three hosting bodies is more or less the same. So, this feature is not useful for distinguishing to which cluster the system belongs. Moreover, if we take a look at the systems of higher generations, it can be seen that the small increase in the values reached is not remarkable in order to interpret if a system has evolved or not.

4.5 Escape velocity distribution

The escape velocity represents a very important feature to distinguish between the different host bodies. In particular, since NSCs are very massive clusters, they are characterized by high values of escape velocities. GCs reach broader intermediate values, while YSCs have the lowest escape velocities.

The same trend is found for both the systems of 2nd and $\gt$ 2nd-generation systems.

4.6 Total mass distribution of the stellar cluster

The following plots show the distribution of the total mass of the cluster. The more massive clusters considered are the NSCs, reaching $10^7 M_\odot$. YSCs and GCs assest around lower values of the order of $10^6 M_\odot$.

The trend remains the same for both systems of 2nd and >2nd generation.

4.7 Number of generations per hosting body

NSCs, because of their high escape velocity, allow a larger number of generations to form, up to primary masses over $\sim10^3 M_\odot$. The main limitation to building even more massive BHs in NSCs is represented by the long timescales: after ≈ 10 generations at most, the simulation reaches the Hubble time. In contrast, the maximum masses in both GCs and YSCs are a few times $10^2 M_\odot$. Another crucial difference between NSCs and either GCs or YSCs is the fraction of nth- to first-generation mergers.

NSCs host up to 10 generations in the fiducial case (and up to 17 in the most optimistic case), while GCs and YSCs typically witness up to 4 − 5 and 3 − 4 generations, respectively. In our case, GCs reach the 5th generation, while YSCs stop at the 3rd generation. This can be seen clearly thanks to the zooms on the distribution.

4.8 BH mass vs Escape velocity

We show the distribution of the masses of the primary and secondary BHs with respect to the escape velocity, one of the parameters designed as relevant in the previous analysis.

Similar behavior is experienced for the three different metallicities. The higher escape velocities associated with NSCs (in the range $100 \, \text{km/s} \lesssim \texttt{escape\_vel} \lesssim 400 \, \text{km/s}$) relate to higher BH masses (up to $500 \, M_\odot$).

YSCs and GCs share lower velocities ($ < 100 \, \text{km/s}$). Regarding the masses, the range changes with the metallicity: higher metallicities are associated with lower masses. In general, for these two clusters, the masses are of the order of $100 \, M_\odot$.

4.9 BH remnant mass vs Metallicity

In the following plot, we can see which range of BH remnant masses is reached according to the metallicity and the cluster. NSCs show the higher masses to be concentrated at lower metallicities. YSCs have a similar trend for all metallicities, maintaining the values of the masses of the order of $10^2 M_\odot$. For GCs, the systems are almost equally distributed, reaching masses on average in the range of $200 \, M_\odot \lesssim \texttt{remnant\_mass} \lesssim 600 \, M_\odot$.

4.10 BH remnant mass VS Escape velocity

In the following plots, we show the distribution of the BH remnant mass with respect to the escape velocity, distinguishing between the three hosting bodies.

NSCs are confirmed to have the higher escape velocities, producing the heaviest remnants (up to $4000 \, M_\odot$.) YSCs concentrate at low escape velocities ($< 100 \, km/s$) and reach masses $< 500 \, M_\odot$. Similar behavior can be found for the GCs: these systems have low escape velocities of the order of $10^2 \, km/s$, with the BH remnants reaching at most $\approx 10^3 M_\odot$.

5. Classification analysis

In this section, we describe the machine learning tasks and their results.

The reader must keep in mind that if they run the notebook cells again the values could be slightly different due to stochasticity, given by different functions used.

5.1 Data preprocessing

With h.data_preprocessing we prepare the dataset for the training and testing of the ML model.
One important parameter is balanced_label. Having a balanced label dataset for a classifier. In our dataset, very few systems, w.r.t. other labels, belong to label = 2 (YSC). This is a problem if we restrict the dataset to fewer samples than the total (i.e. using the n_sample parameter). In fact, on the restricted dataset the models struggle to generalize over the label 2. So if we decide to restrict our dataset is mandatory to balance the labels (and the maximum number of samples will be label_2 $\times$ 3).

Here below we show what we said above. We train a simple Random Forest (RF) model three times:

There are several diagnostic tools to infer model performances. In our analysis, we use:

Accuracy
It measures the proportion of correctly classified instances out of the total instances. In other words, accuracy indicates the percentage of predictions that the model got right. Accuracy is commonly used when the classes in the dataset are balanced. It's a simple metric to evaluate overall model performance.

Confusion matrix The Confusion matrix summarizes the predictions made by the model on a dataset by comparing them to the actual labels. The matrix is organized into rows and columns, where each row represents the actual class labels and each column represents the predicted class labels. The main diagonal of the matrix shows the instances that were correctly classified, while the off-diagonal elements represent misclassifications.

ROC and AUC
The Receiver Operating Characteristic (ROC) curve and Area Under the ROC Curve (AUC) are used when classes are imbalanced or when you want to evaluate how well your model can distinguish between different classes.

5.1.1 Full dataset w/o balanced labels

To run the simulation over the entire dataset it takes ~ 5 minutes. We achieve a pretty high accuracy on the test score, ~97%. But if we look at the confusion matrix, that tells us how the model had classified the samples, we see that ~50% of the label 2 are predicted as label 0. This is due to the fact that the label count of label 2 is 2 (3) order less than label 0 (1). The accuracy is high, but if we look at the AUC for label 2 we see that its value it's lower than the AUC for the remaining labels.
The labels, in the complete dataset are unbalanced and lead to a bad generalization over unseen data.

The same conclusions can be drawn for the label_ngen task, where the model struggles to classify the positive label (i.e. label 1).

5.1.2 Restricted dataset w/o balanced labels

This time we take 100.000 samples, among all the labels, that correspond to ~3% of the dataset. The labels are unbalanced and we can notice that label 2 is still 2 (3) orders less than the other two labels. The results are similar as before.

Again, with label_ngen task, we can see that label 0 is mislabeled ~50% of the time.

The performance results are very similar to the ones above, so we can conclude that the dataset is still too unbalanced to guarantee an optimal classification.

5.1.3 Restricted dataset w/ balanced labels

In this case, whether we downsample or not, the dataset will have $n_{sample} \le 7207\cdot3 = 21621$, which is ~0.66% of the complete dataset.
This time, thanks to the balanced dataset, the labels are classified correctly with great accuracy ($\ge 88\,\%$). The test accuracy is la bit lower than the one in the benchmark above ($\sim 92\%$), but still a good result. The AUC is very close to 1, indicating that the model, even if it's SAP, is optimal to tackle this task.

Regarding the classification problem with label_ngen, the dataset will have $n_{sample} \le 401335\cdot2 = 802670$, which is ~24% of the complete dataset. We can downsample the dataset to gain some computational speed. The test accuracy is around ~79% and the correct predictions are ~87% and ~71%, for 0 and 1. The AUC increased a little gaining a 3%.
This is an acceptable result but we have to keep in mind that we are working with a simple model. In the section below we will see if a fine-tuned model can lower the number of misclassifications or if the problem is related to the dataset itself.

We do not propose a similar analysis with XGBoost for brevity. The results are the same as for the RF.
So, given the results above, we can conclude that the best way is to work with a balanced dataset in order to have a better generalization over unseen data, thus improving model evaluation.

We have also seen that we can reduce the number of samples while retaining a good accuracy score. The next step is to understand what is the optimal size of the sample dataset.

5.2 Best dataset size

In this section, we'll propose an analysis to understand which is the best sample size without losing too much accuracy. This is done to gain computational speed.
We'll use plot_learning_curve function. This function takes in input the data that we want to analyze and the ML model. It uses the sklearn learning_curve function to calculate the mean training/test scores. The sklearn function divides the input data in the training set of different sizes.
Similar tests are done with a simple XGB model and for label_ngen.

5.2.1 Conclusion

These are the best values found from the test above. These are the samples count that will be used in the classification tasks.

5.3 Classification tasks

In this section, we will train two kinds of classifier: a Random Forest classifier and XGBoost classifier. Both models will be described in their respective subsections.
Although the classification results are already great with a simple version of the models, we want to find the best combination of the most important hyperparameters in order to get the best results. We will use the GridSearchCV function from sklearn. Provided a grid of values, the function will test each combination of the provided hyperparameters and will return the best combination. This will be used to train and test the model, hereafter RF_best and XGB_best.

We wanted to test the performance of two similar models, yet different, to have a broader choice before performing the features importance analysis. For each classification task, we will choose the model with the highest scores and use its result to analyze the features importance.

5.3.1 Random Forest Classifier

A random forest is a meta estimator that fits many decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
It builds multiple decision trees (hence the term "forest"), each based on a random subset of the training data and features. The randomness helps in creating diversity among the trees, reducing the risk of overfitting and improving the model's generalization. For classification tasks, each tree in the forest "votes" for a class, and the class with the most votes becomes the model's prediction and provides a measure of feature importance, indicating which features are most influential in making predictions. This can be helpful for feature selection and understanding the underlying relationships in the data.

To complete those tasks we use the custom function gridsearch_RF that implements the sklearn GridSearchCV function.
This time model_evaluation custom function will plot also the features importance histogram. We will use it in the next section.

gridsearch_scores will show the training/test scores belonging to the best combination of the hyperparameters.

5.3.1 a) label classification

We exclude the cluster_mass features because is an intrinsic feature of the host systems we want to try to classify based on BHs parameters.
The best hyperparameter combination is max_depth=15, min_samples_split=10, n_estimators=300, where:

With this combination, the RF model reaches a test accuracy of ~93%. If we look at the confusion matrix we can see that $>= 88\%$ of the labels are correctly predicted. We can be satisfied with the model and the hyperparameters chosen through GridSearchCV.

The results from the ROC are similar to the ones in Sec. 5.1.3, indicating that even the simplest model can tackle the classification w.r.t. label.

Score %
Training score 97.4 %
Test score 92.7%

Label Correct predictions
0 88%
1 95%
2 95%

5.3.1 b) label_ngen classification

The best hyperparameter combination is max_depth=50, min_samples_split=100, n estimators=200, where:

With this combination, the RF model reaches a test accuracy of ~80%. If we look at the confusion matrix we can see that the model struggles to classify correctly ~30% of label=1. Even with a fine-tuned model, there's little improvement.
Looking at the AUC there's a little improvement of 3% w.r.t. the test with the simple model.

We will try to understand if it is a problem related to the model or to the dataset itself.

Score %
Training score 85.0 %
Test score 80.8%

Label Correct predictions
0 87%
1 75%

5.3.2 XGBoost Classifier

XGBoostClassifier is an implementation of the XGBoost algorithm for classification tasks. XGBoost stands for Extreme Gradient Boosting, which is an advanced implementation of gradient boosting algorithms.

XGBoostClassifier is based on the gradient boosting framework, which is an ensemble learning technique where multiple weak learners (typically decision trees) are combined to form a strong learner. It builds trees sequentially, where each subsequent tree corrects the errors made by the previous ones, thus gradually improving the overall model performance.
XGBoost incorporates regularization techniques to prevent overfitting, such as L1 and L2 regularization on the leaf weights and tree complexity.
Similar to RandomForestClassifier, XGBoostClassifier provides feature importance scores, allowing users to identify the most influential features in the model.

To complete those tasks we use the custom function gridsearch_XGB that implements the sklearn GridSearchCV function.
This time model_evaluation custom function will plot also the features importance histogram. We will use it in the next section.

gridsearch_scores will show the training/test scores belonging to the best combination of the hyperparameters.

5.3.2 a) label classification

We exclude the cluster_mass features because is an intrinsic feature of the host systems we want to try to classify based on BHs parameters.
The best hyperparameter combination is colsample_bytree=0.5, gamma=O.2, learning_rate=0.2, max_depth=25, n estimators=150, reg_lambda=10, subsample=0.5, where:

With this combination, the XGB model reaches a test accuracy of ~93%. If we look at the confusion matrix we can see that $>= 90\%$ of the labels are correctly predicted. We can be satisfied with the model and the hyperparameters chosen through GridSearchCV. The training score equal 99.5% is an indication of slight overfitting: the model performs exceptionally well on the training data but fails to generalize completely unseen data.
However, the ROC shows that XGB_best_lab is an optimal model.

Score %
---------------- -------
Training score 99.5%
Test score 92.8%

Label Correct predictions
0 90%
1 94%
2 94%

5.3.2 b) label_ngen classification

The best hyperparameter combination is colsample_bytree=0.5, gamma=0.5, learning_rate=0.1, max_depth=50, n estimators=200, reg_lambda=30, subsample=0.3, where:

With this combination, the XGB model reaches a test accuracy of ~80%. In this case, the model seems not to overfit, however it still struggles to classify correctly ~25% of label=1. In addition to the RF model results, this could be an indication of a dataset-related problem.

Score %
Training score 86.8%
Test score 80.0%

Label Correct predictions
0 87%
1 73%

5.3.3 Discussion

Since both models have similar results, we choose to use in the feature analysis the RandomForestClassifier model RF_best_*, because it is less complex and easier (i.e. faster) to train.

6. Features importance analysis

We split our features importance analysis into two parts for the two different tasks:

To do so we will visually inspect the results from the SHAP analysis (explained in the next sub-section). shap_explainer takes in input the trained model and the background data that we want to integrate (i.e. analyze to understand features importance w.r.t. those data). The parameter n_sample is optional. It refers to the number of samples to select. Since the model is already trained, anywhere from 100 to 1000 random background samples are good sizes to use.
Once the SHAP values are calculated we can plot those values in 2 different kinds of plots, using two custom functions: plot_shap_bar and plot_shap_violin. The meaning of the plots is explained in the next subsection.

6.1 SHAP values

SHAP (SHapley Additive exPlanations) values are a method used to explain the predictions of machine learning models. They were introduced by Lundberg and Lee in 2017 and are based on cooperative game theory.

SHAP values provide a way to allocate the contribution of each feature in a prediction model to the final prediction. They aim to quantify the impact of each feature on the prediction outcome by considering all possible combinations of features and calculating the contribution of each feature in each combination.

The core idea behind SHAP values is the concept of Shapley values, which is a method for fairly distributing the payout of a cooperative game among the players based on their individual contributions. In the context of machine learning models, the features are treated as players, and the prediction outcome is the payout.

To calculate SHAP values, a baseline reference is established, which serves as a starting point for the feature contributions. Then, different combinations of features are created, and for each combination, the model predictions are observed. The contribution of each feature is determined by comparing the predictions with and without the feature included.

SHAP values provide several advantages in interpreting machine learning models. They offer a unified framework for feature importance measurement that is both consistent and locally accurate. They also satisfy desirable properties such as consistency, meaning that if a feature is removed or added, the SHAP values change accordingly. Furthermore, SHAP values can be used to explain individual predictions as well as provide an overview of feature importance across the entire dataset.

In our study, we employed three types of visualizations to gain insights into the importance of each feature*.

Summary plot (bar type) for global inspection
This kind of plot examines the mean absolute SHAP value for each feature across all of the data. This quantifies, on average, the magnitude of each feature's contribution towards the predicted label. If a feature has a high mean SHAP value, it suggests that it generally has a strong influence on the predictions. Conversely, a low mean SHAP value indicates that the feature has less impact on the model's predictions. Mean absolute SHAP values are essentially a drop-in replacement for more traditional feature importance measures but have two key advantages:

Summary plot (bar type) for label inspection
We created separate plots for each label to gain a more accurate understanding of the feature importance. By examining the importance of features for each label individually, we obtained specific insights into their contributions to the prediction accuracy for each class.

Summary plot (violin type) for each label (evolution channel)
This is the most useful type of plot for our study. A SHAP violin plot provides a visual representation of the distribution of SHAP values for different features in the model. It helps in understanding the impact of each feature on the model's predictions.

The SHAP violin plot allows you to compare the distributions of SHAP values across different features. It helps identify features that have consistent and impactful contributions (wider violins with higher densities) and those that have less influence (narrower violins with lower densities).

The plot can also reveal features with bimodal or asymmetric distributions, indicating the presence of different subgroups or distinct patterns in the data that affect the predictions differently.

6.2 label features importance

For this task, we will try to give an astrophysical explanation, although the model correctly classified the systems we cannot provide a precise explanation for all the features.

We will focus on the top 5 features, either for brevity and for the fact that they retain almost the totality of the features importance.

6.2.1 Summary plot (bar type)

6.2.2 Summary plot (violin type) for each label

Here we try to give an astrophysical interpretation of the results illustrated by the plot above.



6.3 label_ngen features importance

For this task, we will try to give an astrophysical explanation, but since the model struggles to classify correctly $\sim 20\%$ of the dataset, we cannot be sure about the provided explanations.

We will focus on the top 5 features, either for brevity and for the fact that they retain almost the totality of the features importance.

6.3.1 Summary plot (bar type)

6.3.2 Summary plot (violin type) for each label

Here we try to give an astrophysical interpretation of the results illustrated by the plot above.


7. Conclusions and future work

We investigated the differences between hierarchical binary black hole mergers in NSCs, GCc and YSc and HMs efficiency. We considered two different machine learning algorithms: a RandomForestClassifier and XGBoostClassifier. Both models are based on tree ensemble, but the optimization is different.
Before training the models we studied which dataset setup would lead to better results, using a simple-as-possible RandomForest model. The benchmarking led us to the conclusion that a balanced (label-wise) dataset is preferable and that we could have reached high performance using a reduced dataset, gaining computational efficiency (i.e. faster training).

To retrieve the best model, we performed a grid search on a grid of different hyperparameters. The results of the two models for the two tasks were very close, thus we decided to use only the RandomForestClassifier model in our final analysis. Both classifiers performed greatly on the task concerning the classification of the stellar clusters, reaching both high accuracy and high classification scores. In the task concerning the classification with labeling label_ngen, two different models struggle to classify approximately 25%/30% of label 1, despite the dataset being balanced and the models not exhibiting strong overfitting. There can be multiple causes, for example: data might contain patterns that are inherently difficult to capture or the models used are too complex or too simple for the given task. In future work, other machine learning algorithms may be tried, such as a CNN, which is more flexible in understanding patterns and substructures in a dataset.

We struggled to give an astrophysical interpretation of the features related to the first task. The YSCs are well described while both GCs and NSCs not. Maybe with a different machine learning algorithm and purpose-built dataset, we can achieve better and more explicative results. The features found in analyzing the SHAP values related to the label_ngen task, i.e. assessing the hierarchical mergers efficiency, almost correspond with the ones derived by "classical" statistics methods. One possible evolution to study the efficiency of hierarchical mergers in more detail is to study systems of BHs in their stellar cluster hosts individually, to gain more insights in features importance in different type of stellar clusters.